library(readr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(leaflet)
library(geojsonio)
library(htmltools)
library(htmlwidgets)

Colombia COVID-19

LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset

DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)

GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.

We decided to do central Colombia because it is where the capital is.

The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).

Dataset - First Exploration

colombia_covid <- read_csv("data/colombia_covid.csv")
unique(colombia_covid$`Departamento o Distrito`)
##  [1] "Bogotá D.C."           "Valle del Cauca"       "Antioquia"            
##  [4] "Cartagena D.T. y C"    "Huila"                 "Meta"                 
##  [7] "Risaralda"             "Norte de Santander"    "Caldas"               
## [10] "Cundinamarca"          "Barranquilla D.E."     "Santander"            
## [13] "Quindío"               "Tolima"                "Cauca"                
## [16] "Santa Marta D.T. y C." "Cesar"                 "San Andrés"           
## [19] "Casanare"              "Nariño"                "Atlántico"            
## [22] "Boyacá"                "Córdoba"               "Bolívar"              
## [25] "Sucre"                 "La Guajira"
#options(tibble.print_max = Inf) # to show all the elements of the list, but set it also for other chunks
#options(tibble.width = Inf)
colombia_covid %>%
  group_by(`Departamento o Distrito`) %>%
  count()
## # A tibble: 26 x 2
## # Groups:   Departamento o Distrito [26]
##    `Departamento o Distrito`     n
##    <chr>                     <int>
##  1 Antioquia                   127
##  2 Atlántico                     4
##  3 Barranquilla D.E.            31
##  4 Bogotá D.C.                 542
##  5 Bolívar                       3
##  6 Boyacá                        6
##  7 Caldas                       16
##  8 Cartagena D.T. y C           39
##  9 Casanare                      2
## 10 Cauca                        12
## # … with 16 more rows

Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.

Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.

The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.

We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.

#lat-long
#bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca<-c("Valle del Cauca", 3.4372200, -76.5225000, 150)
cauca<-c("Cauca", 2.43823, -76.6131592, 12) 
antioquia<-c("Antioquia",6.2518400, -75.5635900, 127)
cartagena<-c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila<-c("Huila", 2.9273, -75.2818909, 30)
meta<-c("Meta", 4.1420000, -73.6266400, 12)
risaralda<-c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander<-c("Norte de Santander", 7.8939100, -72.5078200, 21)
caldas<-c("Caldas", 5.0688900, -75.5173800, 16)
cudinamarca<-c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla<-c("Barranquilla D.E.", 10.9685400, -74.7813200, 35) #atlantico
santader<-c("Santander", 7.1253900, -73.1198000, 12)
quindio<-c("Quindío", 4.535000, -75.675690, 23)
tolima<-c("Tolima", 4.43889, -75.2322235, 14)
santa_marta<-c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar<-c("Cesar", 10.4631400, -73.2532200, 16)
san_andres<-c("San Andrés", 12.5847197, -81.7005615, 2)
casanare<-c("Casanare", 5.3377500, -72.3958600, 2)
narino<-c("Nariño", 1.2136100, -77.2811100, 6)
boyaca<-c("Boyacá", 5.767222, -72.940651, 6)
cordoba<-c("Córdoba", 8.7479800, -75.8814300, 2)
bolivar<-c("Bolívar", 10.3997200, -75.5144400, 3)
sucre<-c("Sucre", 9.3047199, -75.3977814, 1)
guajira<-c("La Guajira", 11.5444400, -72.9072200, 1)

gis_data<-data.frame(name="Bogotá D.C.", latitude=4.624335, longitude=-74.063644, cases=542) #bogotà
gis_data$name<-as.character(gis_data$name)
gis_data<-rbind(gis_data, cauca)
gis_data<-rbind(gis_data, valle_de_cauca)
gis_data<-rbind(gis_data, antioquia)
gis_data<-rbind(gis_data, cartagena)
gis_data<-rbind(gis_data, huila)
gis_data<-rbind(gis_data, meta)
gis_data<-rbind(gis_data, risaralda)
gis_data<-rbind(gis_data, norte_santander)
gis_data<-rbind(gis_data, caldas)
gis_data<-rbind(gis_data, cudinamarca)
gis_data<-rbind(gis_data, barraquilla)
gis_data<-rbind(gis_data, santader)
gis_data<-rbind(gis_data, quindio)
gis_data<-rbind(gis_data, tolima)
gis_data<-rbind(gis_data, santa_marta)
gis_data<-rbind(gis_data, cesar)
gis_data<-rbind(gis_data, san_andres)
gis_data<-rbind(gis_data, casanare)
gis_data<-rbind(gis_data, narino)
gis_data<-rbind(gis_data, boyaca)
gis_data<-rbind(gis_data, cordoba)
gis_data<-rbind(gis_data, bolivar)
gis_data<-rbind(gis_data, sucre)
gis_data<-rbind(gis_data, guajira)

gis_data$latitude<-as.numeric(gis_data$latitude)
gis_data$longitude<-as.numeric(gis_data$longitude)
gis_data$cases<-as.numeric(gis_data$cases)
getColor <- function(gis_data) {
  sapply(gis_data$cases, function(cases) {
  if(cases <= 10) {
    "green"
  } else if(cases <= 100) {
    "orange"
  } else {
    "red"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(gis_data)
)

dept<-geojsonio::geojson_read("data/Colombia.geo.json", what="sp")

labels <- sprintf(
  "<strong>%s</strong><br/>",
  dept$NOMBRE_DPT
) %>% lapply(htmltools::HTML)

m <- leaflet(data=gis_data) %>% addTiles() %>%
  addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~as.character(cases)) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data=dept,     
              opacity = 0.2,
              color = "yellow",
              dashArray = "3",
              fillOpacity = 0.1,
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE),
                label = labels)
m
#saveWidget(m, file="figs/colombia.html")

Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.

ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada.

GAIA: added Casanare, for the others we have no data!

However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)

#slice the main dataset
central.colombia.dep<-c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca", "Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows<-which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid<-colombia_covid[central.colombia.rows,]

#slice the gis_data dataset
central_gis_data<-gis_data[which(gis_data$name %in% central.colombia.dep),]
#slice the geojson dataset 
central_dept<-c("SANTAFE DE BOGOTA D.C", "TOLIMA", "CUNDINAMARCA", "META", "BOYACA", "QUINDIO", "CAUCA", "VALLE DEL CAUCA", "RISARALDA", "CALDAS", "BOYACA", "ANTIOQUIA", "SANTANDER", "CASANARE")
central.dept<-dept[which(dept@data$NOMBRE_DPT %in% central_dept),]
getColor <- function(central_gis_data) {
  sapply(central_gis_data$cases, function(cases) {
  if(cases <= 10) {
    "green"
  } else if(cases <= 100) {
    "orange"
  } else {
    "red"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(central_gis_data)
)
#now colombia is yellow and our departments are red
central_map <- leaflet(data=central_gis_data) %>% addTiles() %>%
  addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~htmlEscape(paste(name,cases,sep=" : "))) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data=dept,
              opacity=0.1,
              color="yellow",
              fillOpacity = 0.1) %>%
  addPolygons(data=central.dept,     
              opacity = 0.2,
              color = "red",
              dashArray = "3",
              fillOpacity = 0.1)
central_map
#saveWidget(central_map, file="figs/central_colombia.html")

Some very basics plots

Let’s check the situation (and also the power of ggplot)!

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.

theme_set(theme_classic())
region<-colombia_covid$`Departamento o Distrito`
nrows<-10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(region) * ((nrows*nrows+1)/(length(region))))
df$category<-factor(rep(names(categ_table), categ_table))
waffle_chart <- ggplot(df, aes(x = x, y = y, fill = df$category)) + 
        geom_tile(color = "black", size = 0.5) +
        scale_x_continuous(expand = c(0, 0)) +
        scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
        scale_fill_brewer(palette = "Set3") +
        labs(title="Frequency of cases across Departments", subtitle="Waffle Chart") + 
        theme(#panel.border = element_rect(size = 2),
              plot.title = element_text(size = rel(1.2)),
              axis.text = element_blank(),
              axis.title = element_blank(),
              axis.ticks = element_blank(),
              legend.title = element_blank(),
              legend.position = "right")
waffle_chart

The major number of cases are in the capital Bogotà.

theme_set(theme_classic())
#number of cases confirmed day by day

#fix day column in "international" format so that R can fix the sorting properly
colombia_covid$`Fecha de diagnóstico`<-as.Date(colombia_covid$`Fecha de diagnóstico`, format="%d/%m/%Y")

colorCount<-length(unique(colombia_covid$`Departamento o Distrito`)) 
getPalette<-colorRampPalette(brewer.pal(9, "Spectral"))(colorCount)

daily_confirmed<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_manual(values = getPalette)
daily_confirmed + geom_histogram(aes(fill=`Departamento o Distrito`), width = 0.8, stat="count") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6),
        legend.position = "right") +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across departments",
       x = "Date of confirmation",
       fill = "Department")

The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):

#the max of the ID de caso for each date is the cumulative number of cases confirmed up to that date
cumulative<- as.data.frame(colombia_covid %>%
  group_by(`Fecha de diagnóstico`) %>%
  summarise(max(`ID de caso`)))
  
names(cumulative)<-c("Fecha de diagnóstico", "Cumulative confirmed")
cumulative
##    Fecha de diagnóstico Cumulative confirmed
## 1            2020-03-06                    1
## 2            2020-03-09                    3
## 3            2020-03-11                    9
## 4            2020-03-12                   11
## 5            2020-03-13                   16
## 6            2020-03-14                   24
## 7            2020-03-15                   45
## 8            2020-03-16                   57
## 9            2020-03-17                   75
## 10           2020-03-18                  102
## 11           2020-03-19                  128
## 12           2020-03-20                  175
## 13           2020-03-21                  210
## 14           2020-03-22                  240
## 15           2020-03-23                  306
## 16           2020-03-24                  419
## 17           2020-03-25                  481
## 18           2020-03-26                  491
## 19           2020-03-27                  539
## 20           2020-03-28                  603
## 21           2020-03-29                  700
## 22           2020-03-30                  798
## 23           2020-03-31                  906
## 24           2020-04-01                 1065
## 25           2020-04-02                 1161
cumulative_confiremd<-ggplot(cumulative, aes(x=`Fecha de diagnóstico`, y=`Cumulative confirmed`))

cumulative_confiremd + geom_point(size=3) +
  geom_segment(aes(x=`Fecha de diagnóstico`,
                   xend=`Fecha de diagnóstico`,
                   y=0,
                   yend=`Cumulative confirmed`)) +
  labs(title = "Cumulative number of confirmed cases",
       x = "Date of confirmation") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).

Now let’s explore the distribution of cases across genders and age:

library(ggthemes)
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))

ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +  
                              geom_bar(data = subset(colombia_covid, Sexo == "F")) +
                              geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) + 
                              scale_y_continuous(breaks = brks,
                                               labels = lbls) + 
                              coord_flip() +  
                              labs(title="Spread of the desease across genders",
                                   y = "Number of cases",
                                   x = "Department",
                                   fill = "Gender") +
                              theme_tufte() +  
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   
                              scale_fill_brewer(palette = "Dark3")  

Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.

#create column age_group (didn't modify the original dataset though)
age_groups<-colombia_covid %>% mutate(age_group = case_when(Edad <= 18 ~ '0-18',
                                             Edad >= 19  & Edad <= 30 ~ '19-30',
                                             Edad >=  31 & Edad <= 45 ~ '31-45',
                                             Edad >= 46 & Edad <= 60 ~ '46-60',
                                             Edad >=61 & Edad <= 75 ~ '60-75',
                                             Edad >=76 ~ '76+'))

#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- age_groups %>% 
  group_by(age_group) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(age_group))
age_groups_pie$label <- scales::percent(age_groups_pie$per)

library(ggrepel)

age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(age_group))) + 
  geom_bar(stat="identity", width = 1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="Age groups", 
       x=NULL, 
       y=NULL, 
       title="Distribution of the desease across ages") +
  coord_polar(theta = "y") +
  #geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
  geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
  guides(fill = guide_legend(title = "Group"))
  
age_pie 

This is quite surprising.. I expected elder people to be more affected by Covid-19!

The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia

Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link

Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):

#library(ggpol)

keep <- c("Sexo", "age_group")
age_groups<-age_groups[names(age_groups)%in%keep]
age_groups_count<-aggregate(cbind(pop=Sexo)~age_group+Sexo,
                      data=age_groups,
                      FUN = function(x){NROW(x)})
age_groups_count$count <- ifelse(age_groups_count$Sexo == "F", age_groups_count$pop * -1, age_groups_count$pop)
age_groups_count<-as.data.frame(age_groups_count[names(age_groups_count)!="pop"])

ggplot(age_groups_count, aes(x=age_group, y=count, fill=Sexo)) +
  geom_col() + 
  #facet_share(~Sexo, dir = "h") +
  coord_flip(clip="off") +
  theme_minimal() +
  labs(title = "Distribution of sex by age",
       y = "",
       x = "Age group")

There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(

We are now left to explore the Tipo variable:

theme_set(theme_classic())
#renamed the attribute since I had sime problem due to the presence of *
colnames(colombia_covid)[8]<-"tipo"

type<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_brewer(palette = "Set3")
type + geom_histogram(aes(fill=tipo), width = 0.8, stat="count") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across type",
       x = "Date of confirmation",
       fill = "Type")

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

type_pie<- colombia_covid %>% 
  group_by(tipo) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("tipo", "total_number", "percentage")
type_pie
## # A tibble: 3 x 3
##   tipo        total_number percentage
##   <chr>              <int> <chr>     
## 1 Relacionado          291 29.3%     
## 2 Importado            467 47.0%     
## 3 En estudio           235 23.7%

Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!

imported<-subset(colombia_covid, tipo=="Importado", select = c("País de procedencia"))
imported %>%
  group_by(`País de procedencia`) %>%
  count() 
## # A tibble: 55 x 2
## # Groups:   País de procedencia [55]
##    `País de procedencia`     n
##    <chr>                 <int>
##  1 0                         1
##  2 Alemania                  4
##  3 Alemania - Estambul       1
##  4 Arabia                    1
##  5 Argentina                 2
##  6 Aruba                     2
##  7 Bélgica                   1
##  8 Brasil                   10
##  9 Canadá                    1
## 10 Chile                     2
## # … with 45 more rows

here data are a bit dirty, however I don’t know if the effort of cleaning them will worth the result.. it depends wheter we decide to use this info in our analysis

Missing

I still didn’t integrate the “other part” of the dataset, the one concerning deaths!

Ideas

For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.

Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.

If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!